1004.068 



SAR 14754 



-1- 

VIDEO REGISTRATION BASED ON LOCAL PREDICTION ERRORS 

Cross-Reference to Related Applications 

This application claims the benefit of the filing date of U.S. provisional application no. 
5 60/452,153, filed on 03/05/03, as attorney docket no. SAR 14754. 

Statement Regarding Federally Sponsored Research or Development 

The Government of the United States of America has rights in this invention pursuant to the U.S. 
Department of Commerce, National Institute of Standards and Technology, Advanced Technology 
1 0 Program, Cooperative Agreement Number 70NANB 1 H3036. 

BACKGROUND OF THE INVENTION 

Field of the Invention 

The present invention relates to video processing, and, in particular, to the registration of video 
1 5 sequences. 

Description of the Related Art 

Video registration refers to the process of identifying the (e.g., temporal, spatial, and/or histogram) 
correspondence between two video sequences, e.g., an original video sequence and a processed video 

20 sequence generated from the original video sequence. 

For many applications, such as watermark detection and reference-based video quality 
measurement, a processed video sequence may need to be registered to the original sequence. For 
example, to detect watermarks embedded in pirated videos that are shot using a camcorder, the processed 
video may need to be registered to the original one displayed in the theater. Another area where video 

25 registration is typically needed is in reference-based video quality measurement. To ensure quality of 

service (QoS), it is often necessary to measure the quality degradation between the original video and the 
one received by a client. The received video is often a processed version of the original video. Therefore, 
to achieve a meaningful reference-based quality measurement, the received video is first registered with 
respect to the original video sequence. 

30 Differences between a processed video and the original one often result from a combination of 

spatial misalignment, temporal misalignment, and histogram misalignment. Spatial misalignment is the 
result of spatial manipulation of a video sequence, such as warping, cropping, and resizing (e.g., capturing 
a movie with 2.35:1 aspect ratio using a camcorder with 4:3 aspect ratio). The main causes of temporal 
misalignment are (1) the change of temporal resolution, such as frame rate conversion (e.g., 3-2 pull 

35 down), and (2) the dropping and/or repeating of frames used by video compression algorithms (e.g., 

MPEG-4). The video capturing process also causes temporal misalignment, because the displaying and the 
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capturing generally (1) are not synchronized and (2) operate at different frame rates. In addition, processed 
videos in general have different color histograms from the original videos. This is often the result of video 
processing, such as compression, filtering, or gamma changes. It can also be the result of white balance or 
automatic gain control (AGC) in camcorder capture. 
5 Spatial, temporal, and histogram registration can be used to correct the three types of 

misalignment. Spatial registration and histogram registration have been studied by many researchers. 
However, few studies have been done on temporal registration. A temporal registration scheme for video 
quality measurement was proposed in Jiuhuai Lu, "Image analysis for video artifact estimation and 
measurement," Proc. of SPIE Machine Vision Applications in Industrial Inspection, v. 430 1 , pp. 1 66- 1 74, 

10 San Jose, CA, January 2001 , the teachings of which are incorporated herein by reference. This scheme can 
recover the global offset between two video sequences. The global offset is estimated by maximizing the 
normalized correlation between temporal activity signatures extracted from each sequence. Caspi and Irani 
use a direct search for recovering sequence-level temporal misalignment, such as fixed shift or fixed frame 
rate conversion. See Y. Caspi and M. Irani, "Alignment of non-overlapping sequences," Proc. of IEEE 

15 Int'l Conf on Computer Vision, Vancouver, BC, Canada, July 2001. 

SUMMARY OF THE INVENTION 
Limitations in the prior art are addressed in accordance with the principles of the present invention 
by a temporal registration algorithm for video sequences that, in one embodiment, formulates the temporal 

20 registration problem as a frame-level constrained minimization of a matching cost and. solves the problem 
using dynamic programming. The algorithm is developed based on a frame-level model of the temporal 
misalignment often introduced by video processing algorithms, such as compression, frame-rate 
conversion, or video capturing. One advantage of the present invention is that it can be generalized to 
incorporate spatial and/or histogram registration. Depending on the application, accurate detection of 

25 frame-level misalignments enables corrections to be made to compensate for those misalignments. 

A registration algorithm of the present invention can detect temporal misalignment at the sub- 
sequence (e.g., frame) level instead of only at the sequence level, as in the prior art. Therefore, it can 
recover from a much wider range of temporal misalignments, such as frame dropping or repeating. In 
addition, temporal registration can be combined with spatial and/or histogram registration to recover from 

30 spatial misalignments (e.g., changes of image size during capturing) and/or histogram misalignments (e.g., 
resulting from automatic gain control during capturing). It not only allows the registration to be performed 
according to the video data, but also allows the integration of prior knowledge of what the registration 
should be in the form of contextual cost. Therefore, further improvement of both the accuracy and the 
robustness of the algorithm is possible. The contextual cost can also be adjusted according to the 

35 application by using domain-specific contextual information. 

According to one embodiment, the present invention is a method for identifying correspondence 
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between an original video sequence comprising a plurality of original frames and a processed video 
sequence comprising a plurality of processed frames. The processed video sequence is divided into a 
plurality of processed sets, each processed set having one or more processed frames. For each processed 
set, one or more original sets from the original video sequence are identified, wherein each original set 
5 comprises one or more original frames, and two or more original sets are identified for at least one 

processed set. A mapping is generated for each original set corresponding to each processed set, wherein 
the mapping defines, for the original set, a mapped set that approximates the corresponding processed set, 
and the mapping minimizes a local prediction error between the mapped set and the corresponding 
processed set. For each processed set, the original set whose mapping minimizes an accumulated 
1 0 prediction error for the processed video sequence is selected. 

BRIEF DESCRIPTION OF THE DRAWINGS 
Other aspects, features, and advantages of the present invention will become more fully apparent 
from the following detailed description, the appended claims, and the accompanying drawings in which 
1 5 like reference numerals identify similar or identical elements. 

Fig. 1 illustrates an exemplary video process in which a processed video sequence is generated 
from an original video sequence; 

Fig. 2 shows a grid that can be used to determine temporal registration using dynamic 
programming; 

20 Fig. 3 graphically illustrates temporal misalignment resulting from differences in both frame rate 

and starting time between original and processed video sequences; 

Fig. 4 shows a plot of one possible state transition cost as a function of the difference between the 
matching indices for two consecutive processed frames, which function enforces contextual constraints; 
and 

25 Fig. 5 shows a flow diagram of a solution to the registration problem using dynamic programming, 

according to one possible embodiment of the present invention. 

DETAILED DESCRIPTION 
As used in this specification, the term "video sequence" refers to a set of one or more consecutive 
30 video frames (or fields). In the context of video registration, the term "video sequence" may be considered 
to be synonymous with "processing window" and "registration window." Depending on the particular 
implementation details for the video registration application, a video stream may be treated as a single 
video sequence (i.e., one window) or divided into a plurality of consecutive video sequences (i.e., multiple 
windows). 



35 
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Frame-Level Model of Processed Video Sequences 

This section discusses a model for processed video at the frame level. One application for this 
model is to characterize the video process, where a video displayed in a movie theatre has been captured 
using a camcorder. 

5 Original video frames are denoted as and processed video frames as J } , where / is the frame 

index of the original video sequence, 0 <i < N , and j is the frame index of the processed video 
sequence, 0 < j < M . A processed frame J } can be modeled according to Equation (1) as follows: 

= ^y)[W«tu)Hi ,, "»^] > 
where <p k{j) is one of K possible mapping functions indexed by k{j) . Each of the K mapping 

1 0 functions maps a set of one or more frames of the original video to one frame of the processed video. The 
number of frames needed for the k -th mapping function is denoted by (3{fc) . The matching index a(j) 

is the largest index in the set of original frames that maps to the processed frame J . . 

Fig. 1 illustrates an exemplary video process in which a processed video sequence 104 is generated 
from an original video sequence 102. As shown in Fig. 1, processed frame 0 is generated from original 
1 5 frame 1, processed frames 1 and 2 are both generated from original frame 3, and processed from 3 is 
generated from original frames 6 and 7. 

Two mapping functions, (p x and <p 2 , are used in the example. The number of input frames for <p } 
and <p 2 are /7(1) = 1 and (3(7) = 2 , respectively. With matching index a(0) — \,(p x maps original 
frame 1 to processed frame 0. On the other hand, with matching index a(0) = 7 , original frames 6 and 7 
20 are mapped to processed frame 3 by mapping operation <p 2 . 

The model of Equation (1) is a general frame-level video processing model. It can be applied not 
only to captured video, but also to other frame-level processing including some widely used temporal 
manipulations, such as frame skipping and frame repeating. For example, as shown in Fig. 1 , original 
frames 2, 4 and 5 are not associated with any processed frame. Therefore, they are skipped frames. In 
25 addition, processed frames 1 and 2 are the same, having both been generated from original frame 3. As 
such, they provide an example of frame repeating. 

With the notations defined above, the video registration problem becomes: Given original video 
frames /. and processed video frames J • , estimate the mapping functions <p ku) and matching indices 

cc(j) . One way to estimate <p k{J) and oc(j) is to minimize the distortion between the processed video 
30 frames J j and the model predictions from the original sequence over all possible combinations of k(j) 
and a(j) for all j . 

The minimization problem may be subject to a causal constraint on cc(j) . That is, no frames 
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displayed in the past can be processed at the present time. Formally, this constraint can be expressed as: 
For any given j x and _/ 2 , if j x < / 2 ,then oc{j x )<a{j 2 ). In addition, although other suitable measures 
of distortion may be used, the following discussion assumes the use of the mean squared error (MSE) as 

the distortion measure. Therefore, the registration [k* (0), a* (0), • • • , k*(M), a* (M)] determined by 
5 minimizing the distortion in light of the causal constraint can be computed by Equation (2) as follows: 

[k*(0),a (0), (AO, «* (A/)] 

a/ r 12 (2^ 

argmin £ p - ^ Wwo) )+ p ' " > 4 0) || 

[*(0).-.*(A/)/r(0)S-Sflf(A/)] y=0 

Since fc(y') for 7 = 0, • • • , M can be optimized independently and there is only causal dependency among 
a(j) , the optimization defined in Equation (2) can be solved using dynamic programming. 

Fig. 2 shows a grid that can be used to determine temporal registration using dynamic 

10 programming. In particular, the horizontal axis represents the different frames in the processed video 

sequence from 0 to M, while the vertical axis represents all possible corresponding matching index a{j) , 
i.e., the index of the highest-index original frame used to generate processed frame J \ . 

To solve Equation (2) using dynamic programming, the minimization of Equation (2) may be 
partitioned into stages according to the index j of the processed frame J . . The state for each stage is 

1 5 a(j) , also denoted as i . In the grid of Fig. 2 defined by stages and states, a path (e.g., 202), defined as a 
mapping from stages to states, is denoted as . The cost function to be minimized in Equation (2) is 
the minimal accumulated mean squared error over all mapping functions from stage 0 to stage M along any 
feasible path. 

However, because of the causal constraint a(0) < a(\) <•• < a(M) , a path in Fig. 2 is a 
20 feasible path only if it is monotonically increasing (i.e., non-decreasing) (in value). Fig. 2 shows a feasible 
path 202 from stage 0 to stage M. Fig. 2 also shows all feasible paths (as dashed lined) that pass through a 
particular grid point 204. The solution to Equation (2) is the monotonically increasing path from stage 0 to 
stage M that has the minimal accumulated mean squared error. 

If S(M) denotes the accumulated MSE over a feasible path from stage 0 to stage M, then: 

{M-l r 
i«o),-mm ™(o» ■•*»(«-.>] ~ V«J) lW«*<y>Hi ' • ' • ' 7 «u> 1 II (3) 
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Therefore, the dynamic programming can be summarized as the following three steps: 

(1) Compute the minimal mean squared error at each node (jf, i) : 

k *U\ = 0 = arg min||y . - <p k ]f t . w>x ,—,/,] f (4) 

(2) Recursively, as shown in Equation (3), compute <5(y) , for y = 0,1,..., M . 

(3) After the minimal accumulated MSE for the last stage ( S(m) ) is calculated, back trace to 
compute [k\0),a(0) 9 - 9 k\M),a\M)]. 

Temporal Registration for Camcorder Capture of Feature Films 

The previous section proposed a general model for registering processed video. This section 
applies the general model to temporal registration of, e.g., camcorder-captured movies. 

There are three main causes of the temporal misalignment between an original movie and a 
captured video of that movie. First, the original movie generally displays at a different frame rate from the 
one used for capturing. For example, movies are usually played at 24 frames per second (fps), while a 
camcorder records a video at 30 fps or 60 fields per second. In addition, there is typically an initial offset 
between when a movie is displayed and when the first frame is captured. Finally, the frame rates for 
displaying and capturing may drift. This is particularly the case when projecting and/or capturing using 
film, where the film is transferred from one roller to the other, and changes in the effective diameter of 
each roller can cause a slight change in the frame rate. Because of these three factors, each processed 
video frame generally does not correspond to a single frame of the original sequence. For example, when 
displaying at 24 fps and capturing at 30 fps, most of the processed frames are the weighted average of two 
adjacent frames in the original sequence when each of the two adjacent frames are displayed during a 
different portion of the video camera's integration time for the corresponding processed frame. 

Fig. 3 graphically illustrates temporal misalignment resulting from differences in both frame rate 
and starting time between original and processed video sequences. As a result of these differences, 
processed Frame 1 of Fig. 3, for example, will be based on both original Frame 1 and original Frame 2. 
When the frame rate of capture is greater than or equal to 1/2 the frame rate of display, a processed frame 
J j can be modeled as a linear combination of two consecutive original video frames I. and 
according to the temporal mapping function of Equation (5) as follows: 

J j = «>(/,., /,., ; X ,,. ) = Xj, - I, + (1 - X .. ) - /,_, , (5) 

where A Jf is the fraction of exposure of original frame /. in processed frame J } , where X.. is greater 

than 0 and less or equal to 1 . 

Because there is no explicit relationship between i (the original video frame number) and j (the 
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corresponding processed video frame number), the temporal mapping function of Equation (5) can be used 
to model not only frame integration, but also frame dropping and frame repeating, which are widely used 
temporal manipulations of digital video. In particular, frame dropping can be modeled by increasing i by 
more than one for two consecutive processed frames. Similarly, frame repeating can be modeled by 
5 keeping i constant for two or more consecutive processed frames. 

Given the definition of the temporal mapping operation, the minimization of Equation (2) can be 
rewritten as Equation (6) as follows: 

[a 0 X • - - , a M , X M ] = arg min £ \j - <p[I , / ; X f (6) 

10 Minimization of Local Predication Error 

The dynamic programming described previously involves minimizing the local prediction error 
defined in Equation (4). With the temporal mapping function defined in Equation (5), the goal is to 
minimize the matching error between the current processed frame J . and the temporally mapped frame 
that approximates the processed frame, as represented by Equation (7) as follows: 

S(i,j) = min\\j J -VkuXlt-Mu^-j] f 
15 2 (7) 

= min||y y -tpVtJ^Aji] f =min||j. -(* I k +(1 + A V )/ H ) If 

/tji 11 A.ji II II 

over X j( for any given processed frame J } and original frames /. and . 

The mean squared error e(P, Q) between two frames P and Q can be given by Equation (8) as 

follows: 

e{P,Q) = ^TX(PH, w -Q h J . (8) 

20 where P h w and Q h w are pixels in frames P and Q , respectively, and H and W are the height and 

width of each frame. In the context of the present invention, frame P may be the current processed frame 
J . , and frame Q may be the corresponding "temporally mapped" frame generated using the linear 

combination of original frames /. and I._ x as defined in Equation (5). 

In addition, the "cross correlation" g{P\R,Q) between the difference between (P kmW -Q k , w ) and 
25 (Ph. w ~ Rh.xx ) can t> e defined according to Equation (9) as follows: 

tt w h=Q vv=0 

where frame P is the current processed frame, and frames Q and R are two possible temporally mapped 
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consecutive frames in the original sequence corresponding to frame P . Then, the value of X yt that 
minimizes Equation (7) (i.e., A* 7 ) can be computed in closed form according to Equation (10) as follows: 

gQ/,,/,)-^;/,,/,.,) 



A*.. = max 



0,min 



' e ( J j , ) - 2g(Jj ; /, , /,_, ) + e(J J , / M ) | ' 
and the minimal MSE can be represented according to Equation (1 1) as follows: 



= mm 



' g(-/y,/|)g(>/y,/|M)-g 2 (^;^^/-.) 

\ s(J j , /, ) - 2g{Jj ; /, , / w ) + *( J . , / M ) 



(10) 



(11) 



Contextual Constraints 

Video registration is an ill-posed inverse problem. For a given original video and a processed 
video, there may exist more than one solution. However, due to the nature of the prior knowledge of the 

1 0 application, the solutions to the same problem may have significantly different probabilities. For example, 
frame repeating and frame dropping are usually used infrequently, and they seldom apply to two or more 
consecutive frames. For example, when there are a large number of consecutive similar frames, they are 
more likely from a scene of little motion than caused by consecutive uses of frame repeating. 

Contextual constraints are derived from the prior knowledge of what a solution must satisfy. The 

1 5 use of contextual constraints can reduce the solution space, improve the accuracy, and increase the 

robustness against noise. One contextual constraint already used is the causal constraint on matching 
indices (e.g., a feasible path must be a non-decreasing path). However, not all monotonically increasing 
paths among the state space are allowed. For example, when a change of frame occurs during the capture, 
0 < A < 1 , if <x s = <Xj +l , then either A . = A y+ , (which represents frame repeat), or 0 < A y < A y+1 = 1 . 

20 That is, if two consecutive processed frames correspond to the same set of original frames, then, except 

when the second of the two consecutive processed frames is a repeat of the first of the two. In other words, 
if the first of two consecutive processed frames results from the integration of two original frames, then the 
second processed frame cannot be based on an integration of those same two original frames, since the 
display of the first original frame ended during the integration of the first processed frame. Other results, 

25 such as 0 < A ; < A . +I < 1 or X. > X j+] are invalid. 

Instead of being listed as a set of rules, the contextual constraints can be incorporated into the 
optimization process in the form of a state transition cost C(<p k(J) a J \<p k{J _ l) cXj_ l ) , which is a function of 

the mapping functions and the matching indices of both the current and the previous distorted frames. 
Therefore, the cost function used in Equation (2) becomes: 
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where the first term is called the matching term and the second term is called the context term. 

Fig. 4 shows a plot of one possible state transition cost C c \cCj\&j_ x ) as a function of the 

difference between the matching indices a y _, and a } for two consecutive processed frames, which 
function enforces the following contextual constraints: 

(1) C°(a y ;a y _,) is set to the maximum distortion level when or y < oc^ to enforce the causal 
constraint. 

(2) C°(cr 7 .;a 7 _,) is only 0 when a } ~ a } _ x + 1 . This encourages (but does not require) the 

assumption that the matching index will increment by one from frame to frame in the processed video 
sequence. In other words, frame dropping, where the matching index does not increment, is penalized. 

(3) C°{(Xj\<Xj_ x ) is assigned a positive cost when — a ._, . This penalizes frame repeating. 
Therefore, frame repeating will be selected only if it reduces the matching error significantly. 

Integration of Spatio-Temporal and Histogram Registration of Processed Video 

In addition to temporal distortion, processed video also suffers from spatial misalignment and 
global changes of intensity (also referred to as histogram misalignment), such as gamma changes. The 
general model previously described for processed video effectively isolates the estimation of the matching 
indices from the optimization of the mapping function. Since spatial and intensity (i.e., histogram) 
registration can be incorporated into the mapping operation <p k{J) , temporal registration can be performed 

together with either spatial registration or histogram registration or both all in the same optimization 
process using dynamic programming. 

Model of Histogram Misalignment 

Modern camcorders are sophisticated video recording systems. In order to produce visually 
pleasing videos, a set of algorithms is typically used to improve the appearance of a processed video. 
Some of these algorithms will alter the RGB histograms of the processed video. The most important one is 
Automatic Gain Control (AGC). Depending on the average luminance, AGC will adjust the sensitivity of 
the camcorder by applying a gain to the processed video signal. In addition, White Balance, which 
determines what color is white, also affects the histogram. Finally, the end-to-end gamma from display to 
capture might not be unity. When the forensic watermark only modifies the luminance component, only 
histogram of the luminance needs to be corrected. 

Histogram transformation, also known as histogram shaping, maps one histogram to another. 
Histogram transformation can be represented using a table look-up. For example, for each frame, the 
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transformation may have 256 parameters that map gray levels in a processed video frame to those in the 
original frame. 

Let H r {u) anc ^ H c ( v ) b e ^ e normalized histograms of a reference video frame (which may in 
turn correspond to the linear combination of two video frames from the original video sequence) and the 
5 corresponding processed video frame, respectively. That is, H r (u) is the percentage of pixels that have 
gray level u in the reference frame, and H c (v) is the percentage of pixels that have gray level v in the 
processed frame. Then, the histogram mapping that maps the histogram of the processed video frame to 
the histogram of the reference frame, u =<p h (v) , can be computed as follows. For a given v , let u 0 (v) 

"o( v ) v 

be the largest u such that ^H r (u) < {u) , and w,(v) be the smallest u such that 

u=0 u=Q 

1 0 "f d H r (u) >J^H C (u) , respectively. Then, <p h (v) = • H r (u) / "f^H r (u) . 

u=0 u=0 «=« 0 (v)+l u=u 0 (v)+l 

Further information on histogram misalignment, registration, and correction may be found in A.K. 
Jain, Fundamentals of Digital Image Processing, Prentice Hall, 1989, the teachings of which are 
incorporated herein by reference. 

15 Model of Spatial Misalignment 

Many factors can lead to spatial distortion in processed video. First, the camcorder is generally not 

placed on the optical axis of the projector. In addition, each capture may use different zoom or crop 

parameters. Finally, the camcorder might not be completely stationary during the capturing process. 

Therefore, a processed video is usually a perspective projection of the original video. 
20 Let (x,y,z) be a point on the camcorder's imaging plane and (X,Y,Z) be a point on the screen 

on which the video is projected. Then, the perspective transform from (X,Y,Z) to (x 9 y 9 z) can be 

expressed as Equation (13) as follows: 
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(13) 



where P is the perspective transformation matrix. When the rotation factor can be ignored, /w w , m , 
25 and m zz are the scaling factors, and nt x , m y , and m z are the transitions in X, Y, and Z directions, 

respectively. In some applications, it may be sufficient to assume that the transformation is plane 
perspective (also known as the homography transformation), thereby eliminating the dependency on the 
depth and reducing the number of independent parameters to 8. 

Further information on spatial misalignment, registration, and correction may be found in R. 
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Kumar, H.S. Sawhney, J.C. Asmuth, A. Pope, and S. Hsu, "Registration of video to geo-referenced 



imagery," Proc. ICPR 98, Brisbane. Australia, Aug. 1998, and in S. Baudry, P. Nguyen, H. Maitre, 
"Estimation of geometric distortions in digital watermarking," Proc. oflEEEInt'l Conf on Image 
Processing, Rochester, NY, Sept. 2002, the teachings of both of which are incorporated herein by 
reference. 

Combined Frame-Level Video Processing Model 

To register processed video sequences not only temporally, but also spatially and intensity-wise, 
the model can be expanded by assuming that the same spatial and histogram transformations are applied to 
both original frames used to form each processed frame. In that case, the mapping operation defined in 
Equation (5) can be represented by Equation (14) as follows: 



where y/{<) is the spatial transformation and /?(•) is the histogram transformation. Although the 
composition of Equation (14) depends on the order of how the three transformations are combined, the 
differences among different compositions are small and can be ignored. 

The minimization of local predication for each node previously proposed also needs to include 
spatial registration and intensity registration (e.g., histogram shaping). However, since spatial 
misalignment and global changes of the histogram are inter-related with the weighted summation of 
consecutive frames modeled in Equation (5), no closed form solution is available. As such, ICM (Iterative 
Condition Mode) processing can be used to minimize the local predication for each node. The basic idea 
of ICM is to fix a set of parameters and optimize on the rest. In the present case, ICM is implemented by 
fixing, two of the three sets of parameters and optimizing the third. For one possible implementation, at 
each iteration, ICM first performs spatial registration, followed by histogram shaping, and then temporal 
registration by computation of X jt using Equations (10) and (11). The procedure will iterate until it 

converges (i.e., when the changes in spatial, histogram, and temporal parameters are less than specified 
thresholds) or until a specified maximum number of iterations is reached. Using ICM, the optimization of 
local prediction becomes three optimizations: optimization of temporal frame integration, spatial 
registration, and histogram registration. 

Histogram Registration Constraint 

If there are no contextual constraints imposed, the histogram registration can fail in some special 
cases. For example, when an original video contains a black frame, any frame in the processed sequence 
can be mapped to the black frame by histogram registration with an MSE of 0, which obviously is not a 
desired solution. To address this situation, a contextual constraint can be defined for histogram registration 



J j =cp{I i 



; ,r,p) = *ji ■ p(vVi )) + 0 - ) ■ )) 



(14) 
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that is proportional to the mutual information between the colors of the processed sequence and the colors 
of the original sequence according to Equation (15) as follows: 

C(/ f> J.)= -w(h(I i ) + HV j )-H{I„J j )) (15) 

where w is a scalar weight, and H(I.) is the entropy of probability computed from the normalized 

5 histogram of frame I. . When frame /. is a black frame or when frames /. and J . are independent, the 

contextual cost is 0. When frames /, and J . have a deterministic relationship, the contextual cost is the 

entropy of the normalized histogram of frame /. or frame J j . 

Global Matching Constraint 

10 There is also a global constraint for the temporal registration that is useful to register the processed 

video sequence to the original one. For processed video, the effective length of the processed video should 
equal to the effective length of the displayed video, where the effective length is the length between the 
first and the last frames of a video sequence. Therefore, the temporal correspondence between the 
processed frames and the original frames is relatively fixed. This constraint can be implemented as a 

1 5 global contextual constraint according to Equation (16) as follows: 



C(a{j))=w a 



(16) 



where r o and r c are the frame rates of original and processed sequences, respectively, and w a is a scalar 
weight. 

20 Possible Implementation 

Fig. 5 shows a flow diagram of a solution to the registration problem using dynamic programming, 
according to one possible embodiment of the present invention. According to this embodiment, the 
processed video sequence is analyzed frame-by-frame from the first frame 0 to the last frame M (steps 502 
and 508). Those skilled in the art will appreciate that, in other embodiments, the analysis can be organized 

25 differently. 

For each processed frame, a minimized local prediction error is generated for each of one or more 
different sets of one or more original frames that could correspond to the current processed frame (step 
504). For example, applying the global matching constraint of Equation (16) to the temporal mapping 
function of Equation (5) implies that there are only a limited number of different corresponding pairs of 
30 original frames that could be used to generate any given processed frame. Step 504 involves the 

generation of a different minimized local prediction error (e.g., using Equation (7)) for each of these 
different possible pairs of original frames. Depending on the particular implementation, the minimized 
local prediction error of step 504 could be generated using any combination of temporal, spatial, and 
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histogram registration. As described in previously, when temporal registration is combined with spatial 
and/or histogram registration, ICM processing can be used to generate the minimized local prediction error 
for each different mapping of original frames to the current processed frame. 

For each mapping of original frames to the current processed frame, select the path (as in Fig. 2) 
5 corresponding to the smallest accumulated prediction error (step 506). As indicated in Fig. 2, each 
processed frame can be reached via a finite number of paths corresponding to the previous processed 
frame, each of which will have an accumulated prediction error associated with it. For each mapping 
identified in step 504, step 506 involves the selection (and extension to the current processed frame) of the 
path that minimizes the accumulated prediction error. 
1 0 After the entire processed video sequence has been processed, there will be a number of different 

possible paths identified that map the original video sequence to the processed video sequence. The path 
having the smallest accumulated prediction error is selected as the optimal mapping of the original 
sequence to the processed sequence, where the path can be traced back to identify the frame-by-frame 
mapping between the original sequence and the processed sequence (step 510). In this way, the entire 
15 processed video sequence can be registered with respect to the original video sequence at the frame level. 

Although the present invention has been described in the context of a video frame as a single 
entity, those skilled in the art will understand that the invention can also be applied in the context of 
interlaced video streams and associated field processing. As such, unless clearly inappropriate for the 
particular implementation described, the term "frame,*' especially as used in the claims, should be 
20 interpreted to cover applications for both video frames and video fields. 

Moreover, the present invention has described local prediction error as being based at the frame 
level. Those skilled in the art will understand that, in certain implementations, local prediction error could 
also be based at the level of sets of two or more frames, but fewer than the entire video sequence level. In 
that sense, the notion of local prediction error could be thought of as being at the sub-sequence level, 
25 where each video sub-sequence corresponds to one or more (but not all) of the frames in the video 
sequence, where the term frame could refer to either a video frame or a video field. 

The present invention may be implemented as circuit-based processes, including possible 
implementation as a single integrated circuit (such as an ASIC or an FPGA), a multi-chip module, a single 
card, or a multi-card circuit pack. As would be apparent to one skilled in the art, various functions of 
30 circuit elements may also be implemented as processing steps in a software program. Such software may 
be employed in, for example, a digital signal processor, micro-controller, or general-purpose computer. 

The present invention can be embodied in the form of methods and apparatuses for practicing 
those methods. The present invention can also be embodied in the form of program code embodied in 
tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage 
35 medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, 
the machine becomes an apparatus for practicing the invention. The present invention can also be 
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embodied in the form of program code, for example, whether stored in a storage medium, loaded into 
and/or executed by a machine, or transmitted over some transmission medium or carrier, such as over 
electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the 
program code is loaded into and executed by a machine, such as a computer, the machine becomes an 
apparatus for practicing the invention. When implemented on a general-purpose processor, the program 
code segments combine with the processor to provide a unique device that operates analogously to specific 
logic circuits. 

It will be further understood that various changes in the details, materials, and arrangements of the 
parts which have been described and illustrated in order to explain the nature of this invention may be 
made by those skilled in the art without departing from the principle and scope of the invention as 
expressed in the following claims. 

Although the steps in the following method claims, if any, are recited in a particular sequence with 
corresponding labeling, unless the claim recitations otherwise imply a particular sequence for 
implementing some or all of those steps, those steps are not necessarily intended to be limited to being 
implemented in that particular sequence. 



